As usual we first always check and load in our required packages.
# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')
library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)
This week we will be learning a lot about the sampling distribution of the mean.
In last weeks lab you were introduced to a dataset looking at social media use in young adults. That data comes from a research programme run here at UNSW. However, this experiment has been repeated at 500 universities across the world, to get an in-depth global understanding of social media use in young adults. Today you are going to look at your dataset from the last computing lab, and the data from across 500 universities.
We are now going to read the dataset that we need for this week into
R. The dataset can be read into an object called
global_social_media. Please use the read.csv()
and here() functions to read in the
PSYC2001_global-time-on-social-data.csv file in the code
block below.
#Use the read.csv() and here() functions to read in the dataset.
global_social_media <- read.csv(file = here("Data","PSYC2001_global-time-on-social-data.csv")) #your code goes here
Now, lets check whether the UNSW value (reminder this is
U49 from the ReadME file!) matches the mean value we had
from the first week.
This should be pretty easy to do, and makes use of the
filter() function we used last week. This function is able
to filter rows in your dataset that match a certain condition.
global_social_media %>%
filter(uni_id == "U49") # reminder filter is used to select rows based on given conditions
## uni_id mean_time_on_social
## 1 U49 2.54
Yay ! The output should match the mean of what we calculated last week. But what does this mean ? Why have we bothered to show you this ?
Each value in the new data file is the mean value for the time_on_social variable, for each of the 500 experiments run (U1-U500), i.e. the data contains 500 samples of the mean for time_on_social. This dataset is the result of repeating a single experiment, many times.
Can you find the value for the University of Sydney (reminder this is
‘U102’ from the ReadMefile!) using the filter() function
?
global_social_media %>%
filter(uni_id == "U102")
## uni_id mean_time_on_social
## 1 U102 2.68
We are now going to have a look at what happens to the sampling distribution of the mean as we increase the number of mean samples in our samples (confusing I know !)
To do this we are going to make use of the sample()
function from baseR. Lets first have a look at what this
function does by using the ? syntax. This should have
already opened in a webpage when you first knitted the document.
?sample
This function takes a vector as its first argument. What this means is that we cannot just give it the entire dataframe as it does not know what to do with it. This will result in an error.
sample(global_social_media, size = 10)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
But you may be asking now, what is a vector? You can think of a vector as a single column in our dataframe. In this instance above, since the function only takes a single column, it gets overwhelmed when we pass it the entire dataframe. Poor function !